Checking model assumptions with check_model() in the training sample
Thinking more deeply about predictions and residuals
Collinearity (correlated predictors)
431 strategy: “most useful” model?
Split the data into a development (model training) sample of about 70-80% of the observations, and a holdout (model test) sample, containing the remaining observations.
Develop candidate models using the development sample.
Assess the quality of fit for candidate models within the development sample.
431 strategy: “most useful” model?
Check adherence to regression assumptions in the development sample.
When you have candidates, assess them based on the accuracy of the predictions they make for the data held out (and thus not used in building the models.)
Select a “final” model for use based on the evidence in steps 3, 4 and especially 5.
R Packages
knitr::opts_chunk$set(comment =NA)library(janitor)library(mice)library(naniar)library(patchwork)library(car) ## for vif function as well as Box-Coxlibrary(GGally) ## for ggpairs scatterplot matrixlibrary(broom) ## for predictions, residuals with augmentlibrary(easystats)library(tidyverse)source("c20/data/Love-431.R")theme_set(theme_bw())
Today’s Data
The dm500.Rds data contains four important variables + Subject ID on 500 adults with diabetes.
We want to predict the subject’s current Hemoglobin A1c level (a1c), using (up to) three predictors:
a1c_old: subject’s Hemoglobin A1c (in %) two years ago
age: subject’s age in years (between 30 and 70)
income: median income of subject’s neighborhood (3 levels)
What roles will these variables play?
a1c is our outcome, which we’ll predict using three models …
Model 1: Use a1c_old alone to predict a1c
Model 2: Use a1c_old and age together to predict a1c
Model 3: Use a1c_old, age, and income together to predict a1c
Our fit1 model estimates the value of (100/A1c) to be 20.97 if A1c_old = 0, but we have no values of A1c_old anywhere near 0 in our data1, nor is such a value plausible clinically, so the intercept doesn’t tell us much here.
Our fit1 model estimates the slope of A1c_old to be -0.98, with 95% CI (-1.12, -0.85). Interpret the point estimate…
Suppose Harry has an A1c_old value that is one percentage point (A1c is measured in percentage points) higher than Sally’s. Our fit1 model predicts the 100/A1c value for Harry to be 0.98 less than that of Sally, on average.
Our fit1 model estimates the slope of A1c_old to be -0.98, with 95% CI (-1.12, -0.85). What can we say about the CI?
The confidence interval reflects imprecision in the population estimate, based only on assuming that the participants are selected at random from the population of interest.
Interpreting the fit1 equation (4/6)
Slope of A1c_old in fit1 is -0.98, with 95% CI (-1.12, -0.85).
Model fit1 estimates a slope of -0.98 in study participants.
When we generalize beyond study participants to the population they were selected at random from, then our data are compatible (at the 95% confidence level) with population slopes between -1.12 and -0.85, depending on the assumptions of our linear model fit1 being correct.
Interpreting the fit1 equation (5/6)
Practically, is our data a random sample of anything?
Slope of A1c_old in fit1 is -0.98, with 95% CI (-1.12, -0.85).
Our 95% confidence interval suggests that our data appear compatible with population slope values for A1c_old between -1.12 and -0.85, assuming the participants are representative of the population of interest, and assuming the underlying linear model fit1 is correct.
Interpreting the fit1 equation (6/6)
Can we say “There is 95% probability that the population slope lies between …”?
To find such a probability interval, we’d need to combine our confidence interval (giving compatibility of data with population slope values) with meaningful prior information1 on which values for the population mean are plausible.
Suppose Harry is one year older than Sally and they have the same A1c_old. On average, our fit2 model predicts Harry’s 100/A1c value to be 0.02 higher than Sally’s.
Suppose Harry and Sally are the same age and have the same A1c_old, but Harry lives in a neighborhood with income < $30K, while Sally lives in a neighborhood with income > $50K. On average, our fit3 model predicts Harry’s 100/A1c value to be 0.21 lower than Sally’s.
.resid = residual (observed - fitted outcome) values; larger residuals (positive or negative) mean poorer fit
.std.resid = standardized residuals (residuals scaled to SD = 1, remember residual mean is already 0)
What does augment give us?
aug1 also includes:
.hat statistic = measures leverage (larger values of .hat indicate unusual combinations of predictor values)
.cooksd = Cook’s distance (or Cook’s d), a measure of the subject’s influence on the model (larger Cook’s d values indicate that removing the point will materially change the model’s coefficients)
plus .sigma = estimated \(\sigma\) if this point is dropped from the model
transa1c .fitted .resid .hat
Min. : 5.988 Min. : 4.963 Min. :-6.84847 Min. :0.002859
1st Qu.:11.528 1st Qu.:12.720 1st Qu.:-1.34925 1st Qu.:0.003037
Median :13.699 Median :13.800 Median : 0.04099 Median :0.003831
Mean :13.360 Mean :13.360 Mean : 0.00000 Mean :0.005714
3rd Qu.:15.385 3rd Qu.:14.585 3rd Qu.: 1.57236 3rd Qu.:0.005543
Max. :23.256 Max. :16.844 Max. : 7.87436 Max. :0.067183
.sigma .cooksd .std.resid
Min. :2.273 Min. :0.0000000 Min. :-2.975936
1st Qu.:2.309 1st Qu.:0.0001615 1st Qu.:-0.585759
Median :2.311 Median :0.0009335 Median : 0.017804
Mean :2.309 Mean :0.0034349 Mean : 0.000472
3rd Qu.:2.312 3rd Qu.:0.0029273 3rd Qu.: 0.685871
Max. :2.313 Max. :0.1330338 Max. : 3.447828
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p
198 3.503225 0.00051981 0.18193
A studentized residual is just another way to standardize the residuals that has some useful properties here.
No indication that having a maximum absolute value of 3.5 in a sample of 350 studentized residuals is a major concern about the Normality assumption, given the Bonferroni p-value = 0.182.
augment for models fit2 and fit3
Later, we’ll need the augment results for our other two models: fit2 and fit3.
aug2 <-augment(fit2, data = dm500_i_train) aug3 <-augment(fit3, data = dm500_i_train)
transa1c .fitted .resid .hat
Min. : 5.988 Min. : 5.128 Min. :-6.7622 Min. :0.002868
1st Qu.:11.528 1st Qu.:12.705 1st Qu.:-1.2724 1st Qu.:0.004416
Median :13.699 Median :13.765 Median : 0.0109 Median :0.006527
Mean :13.360 Mean :13.360 Mean : 0.0000 Mean :0.008571
3rd Qu.:15.385 3rd Qu.:14.593 3rd Qu.: 1.6531 3rd Qu.:0.010009
Max. :23.256 Max. :16.774 Max. : 7.6779 Max. :0.068792
.sigma .cooksd .std.resid
Min. :2.267 Min. :0.0000000 Min. :-2.948425
1st Qu.:2.302 1st Qu.:0.0001642 1st Qu.:-0.557132
Median :2.304 Median :0.0009053 Median : 0.004754
Mean :2.302 Mean :0.0030790 Mean : 0.000599
3rd Qu.:2.305 3rd Qu.:0.0028486 3rd Qu.: 0.720881
Max. :2.305 Max. :0.0940908 Max. : 3.376360
transa1c .fitted .resid .hat
Min. : 5.988 Min. : 5.29 Min. :-6.593697 Min. :0.007099
1st Qu.:11.528 1st Qu.:12.75 1st Qu.:-1.347919 1st Qu.:0.009490
Median :13.699 Median :13.74 Median : 0.003505 Median :0.012570
Mean :13.360 Mean :13.36 Mean : 0.000000 Mean :0.014286
3rd Qu.:15.385 3rd Qu.:14.62 3rd Qu.: 1.674379 3rd Qu.:0.016285
Max. :23.256 Max. :16.81 Max. : 7.810286 Max. :0.075026
.sigma .cooksd .std.resid
Min. :2.269 Min. :0.0000000 Min. :-2.879152
1st Qu.:2.305 1st Qu.:0.0001981 1st Qu.:-0.587438
Median :2.308 Median :0.0011510 Median : 0.001538
Mean :2.306 Mean :0.0029771 Mean : 0.000606
3rd Qu.:2.309 3rd Qu.:0.0030504 3rd Qu.: 0.729660
Max. :2.309 Max. :0.0675136 Max. : 3.435729
Collinearity refers to correlations between predictors.
Strong correlations between predictors can inflate the variance of our estimates, and also make it difficult to distinguish effects attributable to the correlated predictors.
When we have multiple predictors (as in fit2 and fit3) it is worthwhile to quantify the extent to which our variances are inflated by this collinearity.
Sometimes, a correlation matrix or a scatterplot matrix can help…
fit3 adds in a categorical predictor: income category (not shown)
Scatterplot Matrix (from Class 19)
I select the outcome last. Then, the bottom row will show the most important scatterplots, with the outcome on the Y axis, and each predictor, in turn on the X.
ggpairs() comes from the GGally package.
temp <- dm500_i_train |>select(a1c_old, age, income, transa1c)ggpairs(temp, title ="Scatterplots: Model Development Sample",lower =list(combo =wrap("facethist", bins =10)))
Scatterplot Matrix (from Class 19)
The vif() function
We can estimate the effect of collinearity directly through the vif() function in the car package.
vif(fit3)
GVIF Df GVIF^(1/(2*Df))
a1c_old 1.047487 1 1.023468
age 1.066866 1 1.032892
income 1.045901 2 1.011283
vif(fit2)
a1c_old age
1.036986 1.036986
(generalized) variance inflation factors above 5 are worthy of our special attention
Model fit2 Check for Collinearity
check_model(fit2, check ="vif")
Model fit3 Check for Collinearity
check_model(fit3, check ="vif")
Posterior Predictive Check: fit1
check_model(fit1, check ="pp_check")
fit1: Observed vs. Predicted
ggplot(aug1, aes(x = .fitted, y = transa1c)) +geom_point() +geom_abline(slope =1, intercept =0, lty ="dashed", col ="magenta")
fit1: Observed and Predicted
aug1 |>reframe(lovedist(transa1c))
# A tibble: 1 × 10
n miss mean sd med mad min q25 q75 max
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 350 0 13.4 2.91 13.7 2.86 5.99 11.5 15.4 23.3
aug1 |>reframe(lovedist(.fitted))
# A tibble: 1 × 10
n miss mean sd med mad min q25 q75 max
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 350 0 13.4 1.77 13.8 1.31 4.96 12.7 14.6 16.8
cor(aug1$transa1c, aug1$.fitted)
[1] 0.609385
fit1: Predictions and Observed
p1 <-ggplot(aug1, aes(x = .fitted, y ="")) +geom_violin() +geom_boxplot(fill ="royalblue", alpha =0.5, width =0.3) +xlim(0, 25) +stat_summary(fun ="mean", col ="red") +labs(title ="Predicted 100/A1c values (model fit1)", y ="")p2 <-ggplot(aug1, aes(x = transa1c, y ="")) +geom_violin() +geom_boxplot(fill ="mistyrose", alpha =0.5, width =0.3) +xlim(0, 25) +stat_summary(fun ="mean", col ="red") +labs(title ="Observed 100/A1c values", y ="")p1 / p2
fit1: Predictions and Observed
Posterior Predictive Check: fit2
check_model(fit2, check ="pp_check")
fit2: Predictions and Observed
p1 <-ggplot(aug2, aes(x = .fitted, y ="")) +geom_violin() +geom_boxplot(fill ="royalblue", alpha =0.5, width =0.3) +xlim(0, 25) +stat_summary(fun ="mean", col ="red") +labs(title ="Predicted 100/A1c values (model fit2)", y ="")p2 <-ggplot(aug2, aes(x = transa1c, y ="")) +geom_violin() +geom_boxplot(fill ="mistyrose", alpha =0.5, width =0.3) +xlim(0, 25) +stat_summary(fun ="mean", col ="red") +labs(title ="Observed 100/A1c values", y ="")p1 / p2
fit2: Predictions and Observed
Posterior Predictive Check: fit3
check_model(fit3, check ="pp_check")
fit3: Predictions and Observed
p1 <-ggplot(aug3, aes(x = .fitted, y ="")) +geom_violin() +geom_boxplot(fill ="royalblue", alpha =0.5, width =0.3) +xlim(0, 25) +stat_summary(fun ="mean", col ="red") +labs(title ="Predicted 100/A1c values (model fit3)", y ="")p2 <-ggplot(aug1, aes(x = transa1c, y ="")) +geom_violin() +geom_boxplot(fill ="mistyrose", alpha =0.5, width =0.3) +xlim(0, 25) +stat_summary(fun ="mean", col ="red") +labs(title ="Observed 100/A1c values", y ="")p1 / p2
fit3: Predictions and Observed
Model fit1 Checking
check_model(fit1)
Model fit2 Checking
check_model(fit2)
Model fit3 Checking
check_model(fit3)
Coming Soon
Assessing the candidate models more thoroughly, in both the training and test samples
MAPE, RMSPE, Maximum Prediction Error, Validated \(R^2\)
Considering Bayesian alternative fits with weakly informative priors
Incorporating multiple imputation in building a final model